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Abstract 

We propose a generalized version of the Dantzig selector. We show 
that it satisfies sparsity oracle inequalities in prediction and estimation. 
We consider then the particular case of high-dimensional linear regres- 
sion model selection with the Huber loss function. In this case we derive 
the sup-norm convergence rate and the sign concentration property of the 
Dantzig estimators under a mutual coherence assumption on the dictio- 
nary. 
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1 Introduction 

Let Z = X x y be a measurable space. We observe a set of n i.i.d. random 
pairs Zi = (Xi,Yi), i = 1, . . . ,n where X* £ X and Yi G y. Denote by P the 
joint distribution of (X;,!^) on X x y, and by P x the marginal distribution 
of Xi. Let Z = (X,Y) be a random pair in Z distributed according to P. 
For any real-valued function g on X, define \\g\\oc = sssswp xeX \g(%)\, \\g\\ = 

(j x g{xfP x {dx)) 1/2 and || fl || n = (± J2ti 9(X t ) 2 f 2 . Let V = {/i,...,/m} 
be a set of real- valued functions on X called the dictionary where M ^ 2. We 
assume that the functions of the dictionary are normalized, so that = 1 
for all j = 1, ... , M. We also assume that ||/j||oo ^5 L for some L > 0. For any 
6 G K M , define f e = Y,f=i 8jfj and J{9) = {j : 6j ^ 0}. Let M(9) = \ J(8)\ 
be the cardinality of J(8) and sign(#) = (sign(#i), . . . , sign(#M )) T where 

1 if t > 0, 
sign(£) = { if t = 0, 
-1 if t < 0. 

For any vector 8 G R M and any subset J of {1, ... , M}, we denote by 8j the 
vector in R M which has the same coordinates as 8 on J and zero coordinates on 
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the complement J c of J. For any integers 1 ^ d,p < oo and w = (wi, . . . , u>d) e 

A / d \ VP 

R d , the Z p norm of the vector w is denoted by \w\ v = I 2Zj=i l w jl P ) ; an d 

\w\oo = maxi^j^d |ttfj-|. 

Consider a function 7 : JxM^ M+ such that for any y in y and u, v! in E 
we have 

\l(y,u) - i{y,u')\ < \u- u'\. 

We assume furthermore that 7(2/, •) is convex and differentiable for any y E y. 
We assume that for any y £ y the derivative d u j(y, •) is absolutely continu- 
ous. Then d u j(y, •) admits a derivative almost everywhere which we denote by 
duliVi ')• Consider the loss function Q : Z x R M — > M + defined by 

Q(z,8)=-Y(3,,f e (x)). (1) 

The expected and empirical risk measures at point 9 in R M are defined 
respectively by 

R(6)=E(Q(Z,6)), 
where E is the expectation sign, and 

1 " 

Rn{9) = - VQ(Zi,fl). 

n * — ' 

i=l 

Define the target vector as a minimizer of over 



»Af. 



arg min R(9). 

Note that the target vector is not necessarily unique. From now on, we assume 
that there exists a s-sparse solution 9* , i.e., a solution with M{6*) ^ s, and that 
this sparse solution is unique. We will see that this is indeed the case under the 
coherence condition on the dictionary (cf. Section 3 below). 
Define the excess risk of the vector 9 by 

£{9) = R(9)-R{9*), 

and its empirical version by 

£n{6) = Rn(9) ~ Rn(e*). 

Our goal is to derive sparsity oracle inequalities for the excess risk and for the 
risk of 9* in the l\ norm and in the sup-norm. 

We consider the following minimization problem: 

min|0|i subject to VR n {9) ^ r, (2) 

0£Q 00 

where \7R n = {dg 1 R n , . . . , de M R n ) T , r > is a tuning parameter defined later 
and 9 is a convex subset of K M specified later. Solutions of ((2j) , if they exist, 
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will be taken as estimators of 9* . Note that we will prove in Lemma 3 that under 
Assumption [2] the set {9 € O : Vi? n (#) ^ r} is non-empty with probability 

oo 

close to one. Note also that in the applications considered in Section 3, the 
constraint \^R n (S)\oa ^ r can be defined as a system of inequalities involving 
convex functions. Thus, solutions to ^ exist and can be efficiently computed 
via convex optimization. In particular, for the regression model with the Huber 
loss, the gradient VR n (9) is piecewise linear so that ((2j) reduces in this case to 
a standard linear programming problem. Denote by the set of all solutions of 
(J2j) - For the reasons above, we assume from now on that 0^0 with probability 
close to one. 

The definition of our estimator <j2j) can be motivated as follows. Since the loss 
function Q(z, •) is convex and differentiable for any fixed z £ Z, the expected risk 
R is also a convex function of 9 and it is differentiable under mild conditions. 
Thus, minimizing R is equivalent to finding the zeros of Vi?. The quantity 
VR n (9) is the empirical version of \7R(9). We choose the constant r such that 
the vector 9* satisfies the constraint \WR n (8*)\ ^ r with probability close to 
1. Then among all the vectors satisfying this constraint, we choose those with 
minimum l\ norm. Note that if we consider the linear regression problem with 
the quadratic loss, we recognize in ([2]) the Dantzig minimization problem of 
Candes and Tao [7]. From now on, we will call (J2J) the generalized Dantzig 
minimization problem. 

Bickel et al. p], Candes and Tao [7] and Koltchinskii [12] proved that the 
Dantzig estimator performs well in high-dimensional regression problems with 
the quadratic loss. In particular they proved sparsity oracle inequalities on the 
excess risk and the estimation of 9* for the l p norm with 1 ^ p ^ 2. 

The problem ls closely related to the minimization problem: 

mini?„(60 +r|0|i, (3) 
eee 

which is a generalized version of the Lasso. For the Lasso estimator, Bunea et 
al [5] proved similar results in high-dimensional regression problems with the 
quadratic loss under a mutual coherence assumption [8] and Bickel et al [1] under 
a weaker Restricted Eigenvalue assumption. Koltchinskii [11] derived similar 
results for the Lasso in the context of high-dimensional regresssion with twice 
differentiable Lipschtiz continuous loss functions under a restricted isometry 
assumption. Van de Geer [221 123] obtained similar results for the Lasso in the 
context of generalized linear models with Lipschtiz continuous loss functions. 
Lounici [15] derived sup-norm convergence rates and sign consistency of the 
Lasso and Dantzig estimators in a high-dimensional linear regression model 
with the quadratic loss under a mutual coherence assumption. 

The paper is organized as follows. In Section 2 we derive sparsity oracle in- 
equalities for the excess risk and for estimation of 9* for the generalized Dantzig 
estimators defined by |(2]) in a stochastic optimization framework. In section 3 
we apply the results of Section 2 to the linear regression model with the Huber 
loss and to the logistic regression model. In Section 4 we prove the variable 
selection consistency with rates under a mutual coherence assumption for the 
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linear regression model with the Huber loss. In section 5 we show a sign con- 
centration property of the thresholded generalized Dantzig estimators for the 
linear regression model with the Huber loss. 



2 Sparsity oracle inequalities for prediction and 
estimation with the l\ norm 

We need an assumption on the dictionary to derive prediction and estimation 
results for the generalized Dantzig estimators. We first state the Restricted 
Eigenvalue assumption [I]. 

Assumption 1. 

C(s)= min min ^4 > 0. 

J C{l,..,M}:|JoK S A^OrlAjgl^lAjj! |Aj | 2 

It implies an "equivalence" between the two norms |A|2 and ||/a|| on the subset 
{A^0:|A J(A) .|i < |A J(A) |i} of R M . 

We need the following assumption on ||/e*||oo- 

Assumption 2. There exists a constant K > such that ||/e*||oo ^ K. 

From now on we take for 8 the set 

= {9eR M : IIMloo < K}. 

The following assumption is a version of the margin condition (cf. [21]). It 
links the excess risk to the functional norm || ■ ||. 

Assumption 3. For any 9 <E Q there exits a constant c > depending possibly 
on K such that 

Wfo-fe4<c(R(0)-R(P)) 1/K , 

where 1 < k ^ 2. 

We will prove in Section 2.1 below that this condition is always satisfied with 
the constant n = 2 for the regression model with Huber loss and for the logistic 
regression model. We also need the following technical assumption. 

Assumption 4. The constants K and L satisfy 

1 s$ K, L ^ 

Define the quantity 



log M 



AV2L^ + 2V-J^. (4) 
n V n 
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We assume from now on that f ^ 1. 

The main results of this section are the following sparsity oracle inequalities 
for the excess risk and for estimation of 0* in the l\ norm. Define 

r = 6R 7 IUr. (5) 

Theorem 1. Let Assumptions^ be satisfied. Take r as in ([5p. Assume that 
M{6*) < s. Then, with probability at least 1 - M _1 - M~ K - Mr 2K log ^ 
we have 

2(l + 2K)cr^\^ , 



eee 

and 



sup£(0) -i / v + 12||9„7|| (6) 



s) y k - 1 



B " " r 11 4 ( w ) *"' ((1 + 2 * ,r) * + F^W (7) 

Note that the regularization parameter r does not depend on the variance 
of the noise if we consider the regression model with non-quadratic loss. In this 
case, the use of Lipschtiz losses enables us to treat cases where the noise variable 
does not admit a finite second moment, e.g., the Cauchy distribution. The price 
to pay is that we need to assume that ||/e*||oo ^ K with known K. 

Proof. For any 6 G define A = 6 — 9* . We have 

£(§) < £ n (§) + £{§)- £ n (§) 

~ £(6)-£ n (0) n ^ 
= £n{6)+ |A| 1+ r (|A|l+r) 



By Lemma Q] it holds on an event Ai of probability at least 1 — M ~ K — 
3M~ 2K log ^ that 

8(6) -£J6) nTjr 
sup - , " v : 2Kr. 9 

For any G 0, we have by definition of the Dantzig estimator that < 
|6>*|i. Thus 

|A J(9 . )C | 1= E l^|-|^-K|A J(e .)|i. (10) 

j£J{e*y je.J(e-) 

Define the function g : t — > R n {6* + tA). Clearly g is convex and differen- 
tiable on [0, 1]. Thus, the function g' is nondecreasing on [0, 1] with derivative 
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g'(t) = Vi?„((9* + iA) T A. The constraint V-R„(6») s$ r in © and LemmaS 
yield, on an event A2 of probability at least 1 



£ n {9) = R n (9)-R n ( 

el 

VRn(6* 



-M~\ 
*) 

tA) T Adt 



< HAN, 



for some numerical constant C > 0. 

Combining (|8)l- fTTj) yields that on the event Ai fl ^2 



£(0) < (2 + 4K>|A J(9 . ) | 1 + 12||a B 7|| 



,Kf 2 . 



(11) 



(12) 



Next, 



2(l + 2A>|A J(fl . ) |i 2(l + 2X)rVi|A J(e , ) | 2 

2(l + 2X)cr^||/ A || 



€ — 



as) 

2cry / s 



)'■ 













1 /'2(l + 2if)cr > /s 
COO 



: £(0, (13) 



where we have used the Cauchy-Schwarz inequality in the first line, the inequal- 
ity xy |x| k /k + \y\ K /«' that holds for any x, y in K and for any k, k' in (1, 00) 
such that 1/k+ 1/k' = 1 in the third line, and Assumption 2 in the last line. 
Combining (fT2|) and (jT3j) and the fact that f ^ 1 yields the first inequality. The 
second inequality is a consequence of ([6]) and (fl3|) . □ 



1. 



We state and prove below intermediate results used in the proof of Theorem 



Lemma 1. Let Assumptions^ and^be satisfied. Then, with probability at least 
1 - M~ K - MI- 2K log j^j, we have 



\£(9)-£ n (9)\ 
see |0-0*|i + r 

where r is defined in Theorem 1. 

Proof. For any A > 0, define the random variable 

T A = sup |£„(0)-£(0)|. 
eGe:|e-e*|i^A 

For any 9 in and (x, y) in Z we have 

\j(y, f e {x)) - 7(2/, / 9 .(*))| < R7IU (L|0 - 1 



(14) 



A2K), 







and 



e (| 7 (r, f e (x)) 7 (y, (x))| 2 ) < \\d u7 \\l (\e - <r|? a 2^ 2 ) . 

Assumption 3 and Bousquet's concentration inequality (cf. Theorem 4 in 
Section 6 below) with x = (A V 27T) log M, c = 2||<9„ 7 || 00 (AL A 2K) and a = 
V^Wdu-fWooiA A V2K) yield 

¥{T A > E(T A ) + 2AfcrR 7 || 00 f) < A/-( 2X ) vA . 

We study now the quantity E(T A ). By standard symmetrization and contraction 
arguments (cf. Theorems 5 and 6 in Section 6) we obtain 



E(T A ) < 4||d tl7 || 00 E sup 

\6ee:\e-e*\ 1 ^A 



1 " \ 



Then, observe that the mapping u — > — e ifu{Xi) is linear, thus its supre- 

mum on a simplex is attained at one of its vertices. This yields 



E(T A ) < 4||d„ 7 |j 00 AE max 



n \ 
i=l / 



Combining Assumption |4] and Lemma 2 we obtain 

E(T A ) < 4||5„ 7 || 0O Af. 

Thus 

P(T A > dAKWdujW^r) < M-( 2K ) vA . 
Define the following subsets of 6 

e-e*\t <f>, 



6(7) = {6» 6 e 
9(77) = {6» g 9 
9(777) = {6» g 9 

For any t > define the probabilities 



f < 161-61*1! ^2K}. 
{6-6*^ > 2K}. 



Pr = 



P, 



ii 



Pin 

For any t > we have 



sup ™-f"™ >t) 
K eee(i) \O-0*\i+r J 

K ee0(ii) \0~0*\i+r " ^ 

' sup i£(g)-gn(g)i ^; 

,eee(//7) |0-0*|i+r 



(15) 
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Now, we bound from above the three probabilities on the right hand side of the 
above expression. Take t = 12||t? u 7||ooif f. Applying lfT5|) to Pj yields that 

Pj ^ P (T f ^ eWdu-fW^Kr 2 ) ^ M~ 2K , 

since we have f < K by Assumption [H 
Consider now P77. We have 

jo 

Q(II) C |J {9 G 6 : A,- +1 ^ |0 - < A,} , 

3=0 

where Aj = 2 3 K, j = 0, . . . , jo and jo is chosen such that 2 1 ~ : > K > r and 
2~i°K < r. Thus 

jo 

P// < J2 F ( Ta J > W\\8u'Y\\a A j+ iKr) 
j=o 

jo 

3=0 

< (jo + l)M- 2 ^ 
* ,3,1 ° g lolM ] - l]M ^ 



Consider finally P///. We have 



6(7/7) c (J {6e 6 : < |0-0*|i < A,-} 

3=0 

where A 3 = 2 1+ iR, j ^ 0. Thus 

OO 

Phi < H P ( T ^ ^ 12||a„7lloo^-x-K'f 
3=1 
io 

3=0 

OO 

3=1 



□ 



We now study the quantity E (maxi^j^M |^ Yj7=i e ifj(Xi)\)- This is done 
in the next lemma. 
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Lemma 2. We have 



E max 



1 " \ 

-^tifdXi) sjf, (16) 
i=i / 



where f is defined in Qj, 

Proof. Define the random variables 



Uj = 

in 

i=l 



1 ™ 

The Bernstein inequality yields, for any j = 1, . . . , M and t > 0, 

P(m>l) ^(- »HB) )' (17 » 

Set 6j = \\fj\\oc/(3\/n). Define the random variables Tj = Uj^y.^f.p^. 
and Tj = UjTL^Yji^WfjP/bj- F° r all i > we have 

PflTjl > t) < 2exp (-^ , P (|Tj| > i) < 2cxp (-^p) ■ 

Define the function h v (x) — exp(x ,ly ) — 1, where v > 0. This function is clearly 
convex for any v > 0. We have 



E fa 



f\Tj 



V 126, 



/"OC 

/ e'PQTjl > 12bjt)dt < 1, 
./o 



where we have used Fubini's Theorem in the first equality. Since the function 
hi is convex and nonnegative, we have 

hi I E I max J^p- ) ) sC E ( hi ( max 



i<j^M 126,y / V Vi<i<-w 126^ 

A/ 



M, 

where we have used the Jensen inequality. Since the function h± (x) = log(l+x) 
is increasing, we have 

e( max < log(l + Af) 

\i<j<M 126j/ 

fif max |T,A < 4 log(l + M) ^ 
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Applying the same argument to the function /12, we prove that 



E max |T'| sC 2y/3\/\og(l + M) max ||/,||. (19) 



Combining fl8|) and fl9|) yields the result. 

□ 

Lemma 3. Let Assumptions® and\Qbe satisfied. Then, with probability at least 
1 — M , we have 

|Vi?„(r)U <r, 
where r is defined in Theorem 1. 
Proof. For any 1 ^ j ^ M define 

1 " 

i=l 

Since the function 6* — > 7(1/, fe(x)) is differentiable w.r.t. and |9„7(y, /e^))/?'^)! ^ 
||9 u 7||oo-f' for any (x, y) € X x y and 9 e M M , we have 

Next, similarly as in Lemmas [1] and [2 we prove that 

EdViUmoo) «S 4||a u 7|| 00 f. 

Finally Bousquet's concentration inequality (cf. Theorem 4 in Section 6 below) 
yields that, with probability at least 1 — M _1 , 

|Vi? n (0*)|oo <E(|VJ2n(0*)|oo) 

+ (lIML + 2||9 u7 || 0C XE(|Vi?„(^)| 0O ) 

3n 

< 6||9„7j| oc f. 

□ 

3 Examples 

3.1 Robust regression with the Huber loss 

We consider the linear regression model 

Y = fa(X) + W, (20) 
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where X £ M. d is a random vector, W £ R is a random variable independent of 
X whose distribution is symmetric w.r.t. and 9* £ R M is the unknown vector 
of parameters. Consider the function 

1 2* , , _.m_i (2K + af 



4>(x) = -X I|x|^2A'+q + [ (2K + a)\x\ J l\ x \ >2 K+a, 

where a > 0. The Huber loss function is defined by 

Q(z,0) = <j>(y-f e (x)), (21) 

where z = (x, y) £ R d x R and 6 £ 0. 

In the following lemma we prove that for this loss function Assumption [3] is 
satisfied with k = 2 and c = (2/P(\W\ ^ a)) 1 / 2 . 

Lemma 4. Let Q be defined by Then for any 9 £ we have 

!921*2i|/.-A.p« W . 

Proof. Set A = — 9* . Since 0' is absolutely continuous, we have for any 9 £ Q 

Q(z,e)-Q(z,e*) = <j>'(w)f- A (x) 



/ R\W+tf- A (X)\^2K+aO- — t)dt 
JO 

> cf>'(W)f- A (X) + ^l ilwl<a) f A (X) 2 



fA(Xf 



since ||/e||oo ^ K for any 9 £ 0. Taking the expectations we get 

R(9) _ fir)> !<!!ai£)| l/4f , 



for any a > since (/>' is odd and the distribution of W is symmetric w.r.t. 
0. □ 

We have the following corollary of Theorem 1. 

Corollary 1. Let Assumptions^^ and\Qbe satisfied. If M{6*) ^ s, then, with 
probability at least 1 - M^ 1 - M~ K - 3M~ 2K log lo ™ M , we have 

8(1 + 2K) 2 2 2 2 

and 

/.*. 8(1 + 2A) A' 
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3.1.1 Logistic regression and similar models 

We consider Z = (X,Y) E X x {0, 1} where A" is a Borel subset of M. d . The 
conditional probability F(Y = 1 \X = x) = ir(x) is unknown where ir is a 
function on X with values in [0, 1]. We assume that 7r is of the form 

7r(a0=$'(/ e .(z)), (22) 

where the function $ : M — > K* is convex, twice differentiable, of derivative <&' 
with values in [0,1] and the vector 9* € M. M is unknown. Consider, e.g., the 
logit loss function $(u) = log(l + e u ). We assume that $ is known. Define the 
quantity 

t{R) = \ inf <J>< 2 >( U ), (23) 

for any i? > 0. We want to estimate 0* with the procedure ((2]) and the convex 
loss function 

Q(z,e) = -yf e (x)+$(f e (x)), (24) 

where z = (x,y) € R d x {0, 1}. Thus we need to check Assumption [3] to apply 
Theorem 1. 

Lemma 5. Let the loss function be of the form {2$ where $ satisfies the above 
assumptions. Then for any 9 € M. M we have 

r(K)\\fg-fg.f^S(6). 

Proof. For any 9 € 0, we have 

Q(Z,9) -Q(Z,9*) = VQ{Z,9*) T {9-9*) 

fi 

^ 2 \H{X) T {9* + t.(0 - 6*)))(1 -t)dt f A (X) 2 
> WQ(Z,9*) T (9 - 9*) + TQlfelU V \\f e . IU/a (X) 2 . 
Since ||VQ(-, -)\\oo < °°, we can differentiate under the expectation sign, so that 
E(VQ(Z, 9* ) T {9 -9*)) = Vi?(6>* ) = 0. 

Thus 

£{9)>T{\\fe\\ooV\\feA\oo)\\fe- fe4 2 - 

□ 

Thus Assumption [3] is satisfied with the constants n = 2 and c = , 1 
We have the following corollary of Theorem 1. 



Corollary 2. Let Assumptions^^ and^be satisfied. Lf M{9*) ^ s, then, with 

n 

log AS '■ 



probability at least 1 — M 1 — M K — 3M 2K log , ", /f , we have 



4(1 + 2X) 2 _ 2 2 2 



sup v ; .sr 2 + -r 
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and 

„„ 4(1 + 2K) K 
sup 0-0* i ^ )-^—, i ff- 1 



-(X)C(s) 2 3(1 + 210 



4 Sup-norm convergence rate for the regression 
model with the Huber loss 

In this section, we derive the sup-norm convergence rate of the Dantzig esti- 
mators to the target vector 9* in the linear regression model under a mutual 
coherence assumption on the dictionary and Huber's loss. The proof relies on 
the fact that the Hessian matrix of the risk also satisfies the mutual coherence 
condition for this particular model. Unfortunately, we cannot proceed similarly 
in the general case because the Hessian matrix of the risk at point 9* does not 
necessarily satisfy the mutual coherence condition even if the Gram matrix of 
the dictionary does. Note that for Huber's loss the Dantzig minimization prob- 
lem ([2]) is computable feasible. The constraints in ^ are indeed linear, so that 
((2|) is a linear programming problem. 

Denote by the Hessian matrix of the risk R evaluated at 9. With our 
assumptions on the dictionary V and on the function 7, for any 9 e R M we 
have 

= V 2 R(9) = (E^/jW^mAix)))^^. 

Note that for the quadratic loss we have *(■) = 2G where G is the Gram matrix 
of the design. For Lipschtiz loss functions the Hessian matrix '3/ varies with 9. 

We consider the linear regression model <f20|) . For any functions <?, h : X — ► 
K, denote by < g, h > the scalar product E(g(X)h(X)). Define the Gram matrix 
Gby 

G = (< fj,fk >)isy,&^M- 
From now on, we assume that G satisfies a mutual coherence condition. 

Assumption 5. The Gram matrix G = (< fj,fk >)i^j,k^M satisfies 

Gjj = 1, VI j < M, 

and 

where s 1 is an integer and j3 > 1 is a constant. 

This assumption is stronger than Assumption [TJ We have indeed the follow- 
ing Lemma (cf. Lemma 2 in |15|). 

Lemma 6. Let Assumption^ be satisfied. Then 



C00= min min |M > UZl > . 

jc{i,-,m},|j|< s a^o^AjcUsjiajU |A. 7 | 2 V /3 
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Note that Assumption [5] the vector 9* satisfying <f20|) such that M{9*) ^ s 
is unique. Consider indeed two vectors 9 1 and 9 2 satisfying l(20|) such that 
M(9 1 ) < s and M(6> 2 ) < s. Denote 8 = 8 1 - 9 2 and J = J(6» x ) U J(6< 2 ). Clearly 
we have fe(X) = a.s. and M{9) < 2s. Assume that 1 and 9 2 are distinct. 
Then, 

2 



\fo\\i 1 , vig-Im) 



\e\l \o\l 



M 

> 1 - 



— T 



* 1 "^ >0 ' 

where we have used the Cauchy-Schwarz inequality. This contradicts the fact 
that fe(X) = a.s. 

For the linear regression model, the Hessian matrix "J at point 9 is 

= E(1| /9 ,_ 9W+W | Wa /,-(X)/ fe (X)) 1<J - fc ^ A/ . 

We observe that 

*(6>*) = P(|W| < 2K + a)G. 

Thus satisfies a condition similar to Assumption 4 but with a different 

constant if P(|W| ^ 2K + a) > 0. The empirical Hessian matrix ^> at point 
9 £ R M is defined by 

^M0) = % e .- 9 (x i )+^2X^(**)/fc(*i), 1 < J. * < M. 

i=l 

We will prove that the empirical Hessian matrix ^(9) satisfies a mutual co- 
herence condition for any 9 in a small neighborhood of 9* under some additional 
assumptions given below. 

First, we need an additional mild assumption on the noise. 

Assumption 6. There c.d.f. Fw ofW is Lipschitz continuous. 

This assumption is satisfied, e.g., if W admits a bounded density so we allow 
heavy tailed distributions such as the Cauchy. In the sequel we assume w.l.o.g. 
that the Lipschitz constant of Fw equals 1 . 

We impose a restriction on the sparsity s. 

Assumption 7. The sparsity s satisfies s ^ 

This implies that we can recover the sparse vectors with at most O f(n/ log A/) 1 ^ 
nonzero components. 

Define V v = {6 G : \9 — 9* |i r/} where rj = C\rs and 

_ 8(l + 2iQ/3 1 
01 " V(\W\^a)(f3-l) + 6' [ b) 
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Consider the event 

E = 
where 



sup 

i^j,k^M,eev v 



< 8L S T] + ALf 



Co 



(26) 



C 2 = 2,/l + (l + L 2 ) (8CiL 3 + — 



1 + L 2 



We have the following intermediate result. 



Lemma 7. Let Assumptions\^-\^be satisfied. Thenf(E) > 1 — exp(— y/logM). 
Proof. Define the variable 



Z = sup 



Applying the Bousquet concentration inequality (cf. Theorem 4 in Section 6) 
with the constants c = (1 + L 2 )/n, a 2 = 2/n 2 and x = -^r yields that, with 
probability at least 1 — e~ x , 



Z < E (Z) + -t^0 + (1 + i 2 )E (Z) + 



1 + L 2 
3y/ns 2 



(27) 



We study now the quantity E(Z). A standard symmetrization and contraction 
argument yields 



1{Z) ^ 2E sup 



1 n \ 

-^2 e i 1 \h*- l) (X l )+W 7 .\^2K+ a fi( X i)fk(X i ) 
i=l ) 



s$ 2E 



\ n 



2E 



sup 



1 - \ 

-2je»(I|/ fl ,_ 9 (X i )+W i |<2ii'+Q - ft\Wi\^2K+a)fj(Xi)fk(Xi) J 
71 i=l / 



(28) 



Denote by (I) and (LI) respectively the first term and the second term on the 
right hand side of the above expression. The contraction principle yields 



(I) < 4E max 



1 - 

-Y,tifAXi)fk{Xi) 



(29) 



Then, similarly as in the proof of Lemma 2 we get 



E max 



1 " 

-£ei/i(*i)A(*< 



< Lf. 
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Thus, for (II) we have 

(II) ^ 2L 2 E I SUp — \^\f & (Xi)+Wi\^2K+a - I|Wi|<2if+a| 



( 1 " 

\eev n n ^ 



2L 2 ¥{2K + a-Li 1 ^\W\^2K + a + Li]) 

8L 3 r?. (30) 

/ \ 1/4 

Assumptions Hand [7] yield that s < ( j^jj . Combining (|27 | -l(30 | yields the 



, log M 

result. □ 

We need an additional technical assumption. 
Assumption 8. We have \2I?r] + Lr + ^ n\w\^2K+ a ) _ 

This is a mild assumption. It is indeed satisfied for n large enough if we 
assume that PdT^I ^ 2K + a) > since Assumption 6 implies that r — > as 
n — ► oo. 

We have the following result on the empirical Hessian matrix. 



Lemma 8. Let Assumptions \M8\ be satisfied. Then, with probability at least 
1 — exp(— y/log M), for any 6 G V n , we have 

if /mi ^ P(\W\^2K + a) 
mm #,-,(0) > — - — ! — -, 

max|* Afc (0)| < ^, (31) 
w/iere C 3 = A + 12L 3 C*i + 

° op - 1 - y/ns 

Proof. For any # in V v and any j, k in {1, ... , M } we have 

*i,k(0) - *i,fc(fl*) = E ((l|/ A (X)+W|<2if+a - I|M/K2if+a)/,(A)/ fe (A)) , 

where A = 9 - 9* . Then 

- < ^ (|I|/AW+W|<2K-+a " VlOT-al) 

< L 2 P(\W\s:2K + a,\f A (X) + W\>2K + a) 
+ L 2 P (\W\ >2K + a, |/ A (A) + W\ < 2A + a) . 

Recall that |/a(A)| L77. Then 

l*j,feW - I < i 2 P(2A + a - L77 |W| 2A + Q- + L77) 

sC 2L 2 P (2A + a-L?7^iy^2A + Q! + L? ? ) 
«S 4L\ (32) 
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where we have used the fact that the distribution of W is symmetric w.r.t. 
in the second line and Assumption [6] in the last line. Lemma [7] and (|32| yield 
that, on the event E, for any G V v , 

C 2 



mm 



and 



> H\W\ < 2K + a) - Y1LS - 



max I*,-,* «S — . 

j^k S 



□ 



Now we can derive the optimal sup-norm convergence rate of the Dantzig 
estimators. 

Theorem 2. Let Assumptions\MEbe satisfied. If M{6*) ^ s, then, on an event 
of probability at least 1 - AJ- 1 - M~ K - exp(-VEgM) - 3M~ 2K log we 
/lave 

Bup|fl-fl*|ao s? C 4 r, 
where r is defined in Theorem 1, 

4 + 2dC 3 
4 P(|W| < 2K + a) ' 

with C\ and C3 defined respectively in $25)) and Lemma\^ 

Proof. For any in we have 



S7R n (6)-VRr, 



#(0* + iA)dt 



L/0 



A, 



where A = -0*. 

The definition of our estimator, Lemma[3]and Corollary Q] yield that, on an 
event A\ of probability at least 1 - M _1 - exp(-VlogM) - 3M~ 2K log j-j^, 



we have that €V V and 



A 



< 2r. 



Lemma [8] yields that, on the event n £ 
P(|VF| 2K + a) 



AL 2r- 



C 3 



|A|i, 



so that 



□ 



Note that Theorem 2 holds true for the Lasso estimators (2) with exactly the 
same proof, provided that a result similar to Theorem 1 is valid for the Lasso 
estimators. This is in fact the case (cf. [22} [12]). 
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5 Sign concentration property 



Now we study the sign concentration property of the Dantzig estimators. We 
need an additional assumption on the magnitude of the nonzero components of 



Assumption 9. We have 



p= min 10*1 > 2C 4 r, 



where r is defined in Theorem 1 and C4 is defined in Theorem^ 

We can find similar assumptions on p in the work on sign consistency of the 
Lasso estimator mentioned above. More precisely, the lower bound on p is of the 
order (s(logM) /n) 1 / 4 in [17], n~ s / 2 with < 6 < 1 in [H E3, y/ {log Mn)/n 
in [3], y/s{log M)/n in [26] and r in [15]. 

We introduce the following thresholded version of our estimator. For any 
£ the associated thresholded estimator 9 £ ~R M is defined by 

j pi, it\6 j \>C A r, 
[ elsewhere. 

Denote by 6 the set of all such 9. We have first the following non-asymptotic 
result that we call sign concentration property. 

Theorem 3. Let Assumptions^ and\MM be satisfied. If M{9* ) ^ s, then 
P (sign(0) = sign(0*), V0 G ©) > 1 - Ar 1 - M~ K - cxp(- ^log M) 

- ur 2K log 



logM 



Theorem [3] guarantees that the sign vector of every vector 9 £ coincides 
with that of 9* with probability close to one. 

Proof. Theorem [2] yields supg g g, |0 — 0*|oo < C$r on an event A of probability at 
least 1 - 6M" 1 . Take e 0. For j £ J(0*) c , we have 9* = 0, and \9 3 \ c 2 r on 
A For j £ J{9*), we have \9*\ ^ 2C 3 r and \9*\ - \§j\ < |0* - (9Cj| < C 3 r on A. 
Since we assume that p > 2C3, we have on A that |0_,- | ^> C27\ Thus on the event 
A we have: j £ J{9*) ^ \9j\ > c 2 r. This yields sign(0j) = sign(0j) = sign(0j) 
if j £ J{9*) on the event A. If j £ .7(0*), sign(0*) = and 6j — on A, so 
that sign(0j) = 0. The same reasoning holds true simultaneously for all 9 £ 
on the event A. Thus, we get the result. □ 
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6 Appendix 



We recall here some well-known results of the theory of empirical processes. 

Theorem 4 (Bousquet's version of Talagrand's concentration inequality [2]). 
Let Xi be independent variables in X distributed according to P , and T be a 
set of functions from X to R such that E(/(X)) = 0, ||/||oo ^ c and \\f\\ 2 ^ er 2 
for any f G T . Let Z = supy e:F ^™ =1 f(Xi). Then with probability 1 — e~ x , it 

holds that 

Z sC E(Z) + y/2x(na 2 + 2cE(Z)) 



cx 



Theorem 5 (Symmetrization theorem (24], p. 108). Let X\, . . . , X n be inde- 
pendent random variables with values in X, and let ei, . . . , e n be a Rademacher 
sequence independent of X\, . . . , X n . Let T ba a class of real-valued functions 
on X. Then 



E sup 



< 2E sup 



Theorem 6 (Contraction theorem [14], p. 95). . Let x\,...,x n be nonrandom 
elements of X, and let J- be a class of real-valued functions on X. Consider 
Lipschitz functions 7; :— > R, that is, 

l7i0)-7i(s')l < |*-*'|, Vs,s' 6 1 

Let ei, . . . , e n be a Rademacher sequence. Then for any function f* : X — ► R, 
we have 



E sup 



E^(^(/(^))-7,(r(^))) 



< 2E sup 
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